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We present a new analysis of the Sloan Digital Sky Survey data aimed at producing a detailed map of the 
nearby (z < 0.5) universe. Using neural networks trained on the available spectroscopic base of knowledge we 
derived distance estimates for ~ 30 million galaxies distributed over ~ 8, 000 sq. deg. We also used unsupervised 
clustering tools developed in the framework of the VO-Tech project, to investigate the possibility to understand 
the nature of each object present in the field and, in particular, to produce a list of candidate AGNs and QSOs. 



1. Introduction 

The recent developments in the fields of wide 
field digital detectors, dedicated survey telescopes 
and computer sciences, promise to fulfill in a few 
years one of the oldest dreams of the scientific 
community, namely, the production of an accu- 
rate taxonomy of the nearby universe. At a very 
basic level, such taxonomy will consist in a de- 
tailed description of both positions and types for 
all objects matching well defined selection crite- 
ria (e.g. flux limited or volume limited samples). 
Even so, however, it will be of the uttermost 
relevance for many fields of cosmology and as- 
troparticle physics |H2j . In what follows we shall 
shortly outline the first results of our ongoing ef- 
forts to produce a 3-D map of the nearby uni- 
verse with a characterization of galaxy types in 
a few, broadly defined, categories: normal (early 
and late type) galaxies, AGN, QSO, etc. Such 
work takes place in the framework of the Eu- 
ropean VO-Technological Infrastructures project 
(VO-Tech, |4 ). We used as input data the Sloan 
Digital Sky Survey Data Release 4 and/or 5 (here- 
after SDSS4/5; [5]) which provides photometric 
data in 5 bands for several hundred million galax- 



ies distributed over ~ 8, 000 sq. deg. and addi- 
tional spectroscopic information for a subsample 
of about 1 million extragalactic objects. 

2. The photometric redshift 

In order to evaluate photometric redshifts we 
made use of an improved version [B] of the Neu- 
ral Networks (NNs) method presented in [718] . 
Both steps were accomplished using the NN archi- 
tecture known as MLP (Multi Layer Perceptron, 

mm)- 

We adopted a two steps approach: first we 
trained an MLP to recognize nearby (id est with 
redshift z < 0.25) and distant (z > 0.25) ob- 
jects, then we trained two separate MLPs to work 
in the two different redshift regimes. Such ap- 
proach finds a strong support in the fact that in 
the SDSS-5 catalogue, the distribution of galaxies 
inside the two different redshift intervals is dom- 
inated by two different galaxy populations: the 
Main Galaxy (MG) sample in the nearby region, 
and the Luminous Red Galaxies (LRG) in the 
distant one. The use of two separate networks 
ensures that the NNs achieve a good generaliza- 
tion capabilities in the nearby sample, leaving the 
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Figure 1. Distribution of spectroscopic versus 
photometric redshifts in the test set. Lighter grey 
points mark LRG galaxies. Notice the larger scat- 
ter of non-LRGs in the distant (z > 0.25) sample. 



biases mainly in the distant one. To perform the 
separation between MG and LRG objects, we ex- 
tracted from the SDSS-4 Spectroscopic Subsam- 
ple (hereafter, SpS) training, validation and test 
sets weighting, respectively, 60%, 20% and 20% 
of the total number of objects (449,370 galax- 
ies). The resulting test set, therefore, consisted 
of 89,874 randomly extracted objects. After this 
first step, the evaluation of photometric redshifts 
was performed working separately in the two 
regimes. Errors were then evaluated on the test 
set by measuring the dispersion of points in the 
z S pec vs Zphot plane, i. e. the variance of the 
z sp ec — Zphot variable, after performing an inter- 
polative correction to correct for residual system- 
atics. 

Our results can be summarized as follows: 

• For the MG sample, the robust variance 
turned out to be 03 = 0.0208 over the whole 
redshift range and 0.0197 and 0.0245 for the 
nearby and distant objects, respectively. 

• For LRG sample we obtained 03 ~ 0.0163 
over the whole range, and 03 ~ 0.0154 and 
03 ~ 0.0189 for the nearby and distant sam- 



3. The clustering 

The implementation of a reliable classification 
scheme requires the partitioning of the observed 
parameter space in clusters of objects sharing 
some underlying common physical property. Ob- 
viously, since there is no a priori reason to as- 
sume that photometric classifications must reflect 
strictly any morphological classification, any ef- 
fective classification method must be unsuper- 
vised and partition the photometric parameter 
space using only the statistical properties of the 
data themselves. 

Most unsupervised methods, however, require 
the number of clusters to be provided a priori. 
This circumstance represents a serious problem 
when exploring large and complex data sets where 
the number of clusters can be very high or, in any 
case, unpredictable. A simple treshold criterium 
is not satisfactory in most astronomical applica- 
tions due to the high degeneracy and the noisiness 
of the data which lead to the erroneous agglomer- 
ation of data. We therefore implemented a hier- 
archical approach which starts from a preliminary 
clustering performed using as unsupervised clus- 
tering algorithm the so called " Probabilistic Prin- 
cipal Surfaces" or PPS described in |12|13j . and 
then makes use of the Negative Entropy concept 
and of a dendrogram structure to agglomerate the 
clusters found in the first phase [BJ. 

We first applied the PPS algorithm to the sam- 
ple of spectroscopically selected SDSS DR-4 ob- 
jects using as parameters for the clustering the 4 
colours obtained from model magnitudes (u-g,g- 
r,r-i,i-z). We fixed the number of latent variables 
and latent bases of the PPS to 614 and 51 re- 
spectively, so obtaining at the end of this step 
614 clusters, each formed by objects which only 
respond to a certain latent variables. We choose a 
large number of latent variables in order to obtain 
an accurate separation of objects and to avoid 
that any group of distinct but near points in the 
parameter space could be projected onto the same 
cluster by chance. These first order clusters were 
then fed to the NEC algorithm which determined 
the final number of clusters. The plateau analy- 
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Table 1 

Distribution of a subsample of objects in the most 
significant clusters. Columns correspond to dif- 
ferent values of the specClass index provided by 
the SDSS. 



sis of the agglomerative process and the inspec- 
tion of the dendrogram allowed to set the tresh- 
old to a value corresponding to 31 clusters. We 
present in table © the ten most populous clus- 
ters together with the distribution of the spec- 
Class index within each cluster. The additional 
21 clusters represent less than 10 % of the ob- 
jects and are still under investigation. It needs to 
be emphasized that the clustering makes use of 
the photometric data only and the spectroscopic 
information is used only to validate them. As 
it can be seen, galaxies (SP2) clearly dominate 
clusters 1, 2 and 6. Whether this separation re- 
flects or not some deeper differences among the 
three groups (such as, for instance, different mor- 
phologies), cannot be assessed on the grounds of 
presently available data. AGNs (SP3) dominate 
clusters 5 and 9 even though some contamination 
from galaxies exists. Late type stars (SP6) pop- 
ulate mainly clusters 7 and 8 and are strong con- 
taminants of cluster 10 which also is dominated 
by galaxies. It needs to be stressed, however, that 
the use of the specClass as label must be taken 
with some caution since it is becoming increas- 
ingly evident that this index suffers from severe 
biases [Tl] . 
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